NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Semisynthetic simulation for microbiome data analysis

https://doi.org/10.1093/bib/bbaf051

Sankaran, Kris; Kodikara, Saritha; Li, Jingyi_Jessica; Cao, Kim-Anh_Lê (February 2025, Briefings in Bioinformatics)

Abstract High-throughput sequencing data lie at the heart of modern microbiome research. Effective analysis of these data requires careful preprocessing, modeling, and interpretation to detect subtle signals and avoid spurious associations. In this review, we discuss how simulation can serve as a sandbox to test candidate approaches, creating a setting that mimics real data while providing ground truth. This is particularly valuable for power analysis, methods benchmarking, and reliability analysis. We explain the probability, multivariate analysis, and regression concepts behind modern simulators and how different implementations make trade-offs between generality, faithfulness, and controllability. Recognizing that all simulators only approximate reality, we review methods to evaluate how accurately they reflect key properties. We also present case studies demonstrating the value of simulation in differential abundance testing, dimensionality reduction, network analysis, and data integration. Code for these examples is available in an online tutorial (https://go.wisc.edu/8994yz) that can be easily adapted to new problem settings.
more » « less
Categorization of 34 computational methods to detect spatially variable genes from spatially resolved transcriptomics data

https://doi.org/10.1038/s41467-025-56080-w

Yan, Guanao; Hua, Shuo_Harper; Li, Jingyi_Jessica (January 2025, Nature Communications)
Exaggerated false positives by popular differential expression methods when analyzing human population samples

https://doi.org/10.1186/s13059-022-02648-4

Li, Yumei; Ge, Xinzhou; Peng, Fanglue; Li, Wei; Li, Jingyi_Jessica (March 2022, Genome Biology)

Abstract When identifying differentially expressed genes between two conditions using human population RNA-seq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates. Expanding the analysis to limma-voom, NOISeq, dearseq, and Wilcoxon rank-sum test, we found that FDR control is often failed except for the Wilcoxon rank-sum test. Particularly, the actual FDRs of DESeq2 and edgeR sometimes exceed 20% when the target FDR is 5%. Based on these results, for population-level RNA-seq studies with large sample sizes, we recommend the Wilcoxon rank-sum test.
more » « less
scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data

https://doi.org/10.1093/bioinformatics/btac271

Song, Dongyuan; Xi, Nan_Miles; Li, Jingyi_Jessica; Wang, Lin; Vitek, ed., Olga (April 2022, Bioinformatics)

Abstract SummaryThe number of cells measured in single-cell transcriptomic data has grown fast in recent years. For such large-scale data, subsampling is a powerful and often necessary tool for exploratory data analysis. However, the easiest random subsampling is not ideal from the perspective of preserving rare cell types. Therefore, diversity-preserving subsampling is required for fast exploration of cell types in a large-scale dataset. Here, we propose scSampler, an algorithm for fast diversity-preserving subsampling of single-cell transcriptomic data. Availability and implementationscSampler is implemented in Python and is published under the MIT source license. It can be installed by “pip install scsampler” and used with the Scanpy pipline. The code is available on GitHub: https://github.com/SONGDONGYUAN1994/scsampler. An R interface is available at: https://github.com/SONGDONGYUAN1994/rscsampler. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
mbImpute: an accurate and robust imputation method for microbiome data

https://doi.org/10.1186/s13059-021-02400-4

Jiang, Ruochen; Li, Wei_Vivian; Li, Jingyi_Jessica (June 2021, Genome Biology)

Abstract A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data—mbImpute—to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. We demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.
more » « less
Clipper: p-value-free FDR control on high-throughput data from two conditions

https://doi.org/10.1186/s13059-021-02506-9

Ge, Xinzhou; Chen, Yiling_Elaine; Song, Dongyuan; McDermott, MeiLu; Woyshner, Kyla; Manousopoulou, Antigoni; Wang, Ning; Li, Wei; Wang, Leo_D; Li, Jingyi_Jessica (October 2021, Genome Biology)

Abstract High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based onp-values. However, obtaining validp-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying onp-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.
more » « less

Search for: All records